A pseudo-value regression approach for differential network analysis of co-expression data

Background The differential network (DN) analysis identifies changes in measures of association among genes under two or more experimental conditions. In this article, we introduce a pseudo-value regression approach for network analysis (PRANA). This is a novel method of differential network analysis that also adjusts for additional clinical covariates. We start from mutual information criteria, followed by pseudo-value calculations, which are then entered into a robust regression model. Results This article assesses the model performances of PRANA in a multivariable setting, followed by a comparison to dnapath and DINGO in both univariable and multivariable settings through variety of simulations. Performance in terms of precision, recall, and F1 score of differentially connected (DC) genes is assessed. By and large, PRANA outperformed dnapath and DINGO, neither of which is equipped to adjust for available covariates such as patient-age. Lastly, we employ PRANA in a real data application from the Gene Expression Omnibus database to identify DC genes that are associated with chronic obstructive pulmonary disease to demonstrate its utility. Conclusion To the best of our knowledge, this is the first attempt of utilizing a regression modeling for DN analysis by collective gene expression levels between two or more groups with the inclusion of additional clinical covariates. By and large, adjusting for available covariates improves accuracy of a DN analysis.

progressed. The regression modelling for RNA-seq differential expression (DE) analysis has been established to compare the number of DE genes under different biological or clinical states, including linear model based limma [5], negative binomial model based edgeR [6], or Poisson log-linear model approach [7]. The analysis of DE has a major limitation in that it looks at one gene at a time, even though a set of genes are often involved in the same biological process [8,9]. In contrast, the differential network (DN) analysis complements the DE analysis [10] by looking at genes collectively.
The DN analysis identifies changes in measures of association (i.e. network properties or topologies) of the networks across biological conditions, which makes it distinct from a single network analysis. Several groups have proposed statistical methods for DN analysis [11][12][13][14]. In particular, DINGO [11] and dnapath [14] have developed methods for RNA-seq data, to find differentially connected (DC) genes in subnetworks corresponding to different pathways, between two groups of patients; e.g. 'high-risk' versus 'low-risk' or 'long-term survivors' vs. 'short-term survivors. ' While these methods are convenient to use and applicable, they do not consider other observed covariates that may be associated with gene expression.
For instance, a previous study has shown that the expression levels of oxidative stressassociated genes were differentially expressed with smokers with chronic obstructive pulmonary disease (COPD) through gene set enrichment analysis using microarray data [15]. Let us suppose we want to carry out DN analysis on expression data that includes oxidative stress-associated genes and smoking status, which would be used as a grouping variable. In practice, clinicians would also want to include additional covariates such as patient history of cardiac arrhythmia [16] and lung carcinoma [17] to garner more information for better prognosis. However, there is no available direct regression modeling for DN analysis regressing gene expression level between the smoking statuses with the inclusion of additional clinical covariates described above.
The pseudo-value approach was first developed from the leave-one-out jackknife subsampling procedure, applied to a marginal quantity representing some aspect of a marginal distribution of the response variable. It was originally introduced by Andersen and his colleagues [18,19] for multi-state survival models. Several studies [20,21] purported that the pseudo-value regression has advantages that the pseudo-values derived from an asymptotically linear and unbiased estimator are approximately independent and identically distributed with the same conditional expectation. Ahn and Logan [22] and Ahn and Mendolia [23] showed that their pseudo-value approaches controlled the type I error while maintaining high power with clustered survival data. With these benefits, we propose a regression modeling method that regresses the jackknife pseudo-values [24] derived from a measure of connectivity of genes in a network to estimate the effects of predictors. Note that the grouping variable itself could also be included in the regression model along with additional clinical covariates while regressing the pseudo-values. We loosely refer to this as a "multivariate setting", whereas in "univariate settings" only the group variable is utilized in a DN analysis.
Thus, in this paper, we introduce a Pseudo-value Regression Approach for Network Analysis (PRANA). This is a novel method of DN analysis that can adjust for additional covariates. We start from mutual information (MI) criteria, followed by pseudo-value calculations, which are then entered into a robust regression model. This article assesses the model performances of our pseudo-value approach in a multivariable setting, followed by a comparison to dnapath and DINGO in the univariable setting through simulations. Lastly, we employ our method in a real data application [25] from the Gene Expression Omnibus (GEO) database [26] to identify DC genes that are associated with COPD. All statistical analyses are performed in R version 4.0.2 (R Foundation for Statistical Computing, Vienna, Austria).

Simulation study
More details of the simulation setup are available in the "Materials" section below. We select p = 20, 50, 100 genes to test our pseudo-value approach in smaller to larger gene networks. For each gene network, five different sample sizes n = 40, 100, 200, 500, 1000 are considered and in each setting, 1000 Monte Carlo replicates. We draw 1000 random samples first, then take the subsamples from this pool for a simulation with a smaller sample size to reduce computational burden. A random network is generated at each simulation replicate in which a layer of randomness is imposed to account for biological variability of the network structure. For additional details on the generation of simulated RNA-seq data, see the "Materials" section. Simulations are repeated to show the performance of our method by altering the effect size from 5%, 10%, to 20% for simulation scenarios I and II.
Results are compared with the true parent network in order to compute the performance measures described in the "Performance Measures" section. In the true network setting, a gene is considered truly DC between groups if it has at least one DC edge connected to other genes. Tables 1 and 2 summarize simulation results in the multivariable setting for scenarios I and II, respectively, when the continuous variable is added as a covariate with the binary group variable in the regression. Table 2 incorporates the effect of covariate when generating random networks, whereas Table 1 does not. For both scenarios, results show that the pseudo-value regression method generally yields a high precision and recall across all specifications of network size, sample size, and effect size. The pseudo-value regression method maintains a high precision while having an acceptable recall, especially, when a smaller sample size is considered. Tables 3 and 4 summarize simulation results for scenarios I and II, respectively, when only the binary group variable is included in the model for pseudo-value calculation. Thus, age dependent networks are simulated for Table 4 but not for Table 3. Two competing univariable methods, dnapath and DINGO are included for these simulations.
Overall, a similar pattern is observed in the univariable setting; i.e., PRANA consistently reaches a high precision and recall. The performance improves as n increases, as to be expected. It is noteworthy that PRANA outperforms dnapath in simulation when the sample size is relatively small regardless of the network size. Our method also shows a better recall value and F1-score than dnapath with small sample sizes ( n = 40, 100 ). As DINGO requires substantially large computational time, it was considered for the gene network with smaller sample sizes only. To be more specific, simulations with larger sample sizes ( n = 500, 1000 ) are stopped after 20 days for DINGO from the University of Florida Research Computing Linux server, HiPerGator 3.0 with 10 CPU cores and 10 GB of RAM per node. See Table S1 in Additional file 1 for more details. Table 5 presents results of scenario III, where age acts as a confounder. That is, an observed difference in connectivity may be due to a difference in the distribution of age in the two groups. Higher precision values from the multivariable pseudo-value regression indicate that PRANA correctly identifies the DC genes, compared with dnapath and DINGO, neither of which accounts for the effects of age. By and large, PRANA has higher precision than DINGO and higher recall than dnapath.
Of the 23 DC genes from PRANA, 5 genes are found exclusive to PRANA (DMWD, MED13L, TNPO1, THRA, and STN1). Notably, DMWD is linked to myotonic dystrophy, a rare genetic muscular disorder [27]. Thyroid hormone receptor alpha (THRA) is related to congenital hypothyroidism [28]. These findings about additional genes will facilitate harnessing of the possible mechanisms at work in COPD exacerbation.
Heat shock protein family A (Hsp70) member 4 (HSPA4) is associated with gastric ulcer [29]. Multifunctional ROCO family signaling regulator 1 (MFHAS1) is linked to soft tissue tumor and cell cycle [30]. HSPA4 and MFHAS1 are DC genes identified in both PRANA and DINGO. Echinoderm microtubule-associated protein-like 4 (EML4) is found in all three methods. It has been studied for its association with lung cancer [30,31]. A Venn diagram is provided to show the overlap between and among three methods (Fig. 4). In addition, a diagram is included to summarize the findings of this application study (Fig. 5).

Discussion
Simulations and real-data analysis have elucidated that PRANA is superior to existing alternatives and a practical tool, which includes covariates in the model. To the best of our knowledge, this is the first attempt to develop a regression modeling in DN analysis. Our working objective is to propose a statistical method that determines whether a gene is significantly DC between groups with the covariate included in the model. In this paper, we have shown through simulations that PRANA reaches a consistently high degree of precision and recall to identify DC genes with varying simulation parameters such as network size, sample size, and effect size. We also analyzed a COPD-related gene expression data from the GEO database. When comparing results from our method to dnapath and DINGO, five COPD-related genes are additionally found DC between current versus non-current smokers: DMWD, MED13L, TNPO1, THRA, and STN1.
There are a number of limitations to be highlighted in this study. We have used the absolute value of the differences between the two adjacency matrices as a proxy to determine the true DC genes. Certainly, this is a practical way to detect differences in the number of edges for each genes in a network. The comparison of maximum values between adjacency matrices was also considered. However, we concluded that they are more useful describing the global characteristic of a network, which deviates from our objective, namely, a gene-specific characteristic of a network.
Another limitation is the inability to perturb simulated networks in a continuous way. Right now, we have discretized the effect of a covariate into three groups. Perhaps, there are other models where a truly continuous covariate could be incorporated.
Lastly, the Pearson correlation, partial correlation, and degree-weighted LASSO were also examined as alternatives to the ARACNE as a measure of association or connectedness, albeit not reported in the paper, due to relatively poor performance and heavy computational costs. It remains an interesting task for future studies to extend our work to other measures of association of a network which better assess different structural changes in the network. We conclude by foregrounding the future direction of the pseudo-value regression approach for the DN analysis, which are potentially extensible to other data types, such as the microbiome data.

Conclusion
The adjustment of covariate is an important step in differential network analysis. In this paper, we presented PRANA, a novel pseudo-value regression approach for the DN analysis, which can incorporate additional clinical covariates in the model. This is a direct regression modeling, and it is therefore computationally amenable for the most users.

Mutual information and ARACNE
Mutual information (MI) determines whether and how two genes interact. That is, it is a measure of their relatedness and calculated from their joint expression profiles. MI is zero if and only if the joint distribution between the expression level of gene j and gene k satisfies P(g j , g k ) = P(g j )P(g k ) for j = k , or if j = k ; in other words, they are statistically independent. Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) which are proposed by [32,33] estimates MI using a computationally efficient Gaussian kernel estimator. The estimate of MI is used to quantify the connectivity of each pair of genes in a network. Given a set of two genes measurements, − → u i ≡ (g ij , g ik ) , i = 1, . . . , n , the joint probability distribution is approximated as where is the bivariate standard normal density and h is the position-dependent kernel width. Then the MI can be expressed as [32]: where f (g j ) and f (g k ) are the marginals of f ( − → u ) . The matrix containing entries Î jk is defined as the association matrix. The ARACNE algorithm copula-transforms the profiles for MI estimation because MI is reparameterization invariant; thus, the range of these transformed variables is between 0 and 1 [34].

Pseudo-value approach
Let Î jk be the MI estimate for a pair of genes j, k ∈ {1, . . . , p} of an estimated network from n individuals. For each gene k, we sum the edges (MI estimates) around the gene by taking the column sum of the association matrix to obtain the total connectivity (which can be deemed as a continuous version of degree centrality) of gene k: The jackknife pseudo-values [24] for the i th individual and k th gene are defined by: where θ k(i) is the column sum of a gene calculated from the re-estimated association matrix using the RNA-seq data without the i th subject. For each gene k, the re-estimation process requires n such calculations with the data size of n − 1.
Let Z a binary group indicator. Let G 1 = {i : Z i = 1} , G 2 = {i : Z i = 2} , and n z = |G z | is the sample size for the two groups z = 1, 2 and n = n z . The jackknife pseudo-values are separately obtained within groups. Following the general formula above, for gene k and group z, we similarly define θ z k and θ z k(i) , where i = 1, . . . , n z . Then for each i ∈ G z , the k th gene jackknife pseudo-values are calculated by θ ik = n zθ z k − (n z − 1)θ z k(i) . Next, a robust regression is applied to regress the pseudo-values on a set of covariates, including Z and X , where Z is the group indicator and X = (X 1 , . . . , X q ) are the potential confounders, such as age and gender. For the i th individual and k th gene, we posit the model where α k is the intercept, β k is the regression coefficient for Z, and γ k1 , . . . , γ kq is the set of regression coefficients to be estimated for X. The main parameter of interest is β k to test for the change in total connectivity (or degree centrality) of the k th gene between groups. Least trimmed squares (LTS), also known as least trimmed sum of squares [35], is then implemented to perform a robust regression. The LTS estimator is defined by where r (i) is the set of ordered absolute values of the residuals (in increasing order of absolute value) and h may depend on some pre-defined trimming proportion c, for instance by means of h = [n(1 − c)] + 1 . In general, c is chosen between 0.5 and 1 [36].

Hypothesis testing
To test whether the true difference in total connectivity of k th gene differs between groups, we test the null hypothesis of H 0 : β k = 0 against the research hypothesis H 1 : β k � = 0 . The t-statistic is computed by β k /SE(β k ) , where SE(β k ) standard error of β k , obtained using large sample theory, and which is the least-squares estimator of β k for k = 1, . . . , p from the robust regression described in equation (2). P-values are calculated using a t-distribution as in robustbase R package [37,38].
It is important to control the false discovery rate (FDR), since multiple hypothesis tests are conducted in the DN analysis. The FDR measures the proportion of false discoveries among a set of genes which are significantly DC between groups. The empirical Bayes screening (EBS) approach [39] has been applied to control the FDR, which is an extension of Westfall and Young step-down adjusted p-values [40]. The EBS procedure is robust against model mis-specification, as it utilizes nonparametric function estimation techniques for the estimation of the marginal density of the transformed p-values.

Materials
This section details step-by-step procedures how the simulation is performed. The performance of our proposed method is assessed by an extensive simulation study. Data are simulated with different number of genes p and sample size n. In this simulation, the regression model includes two covariates Z and X, where Z is the group indicator and X ∼ N (55, 10) is a continuous covariate such as the age of a patient. Three different simulation scenarios are considered.

Data generation
Simulate weighted networks and RNA-seq data with a dependence structure that depends on Z and/or X using the SeqNet R package [41]. In this setting, there are total of six networks for the combination of two groups and three age categories (younger than 50, 50-60, and older than 60). We consider three different scenarios incorporating group information only (scenario I), age and group information (scenario II), and age and group information with unequal sampling proportions with different distributions of the age in the two groups (scenario III) (see Figures 1-3

Scenarios I (a-b) and II (a-c)
a. Generate the first random network with p nodes for z = 1 . The p × p adjacency matrix, where the diagonal elements are 0 and non-diagonal elements are in {0, 1} , is extracted from this first graph. It is a symmetric matrix indicating whether a pair of nodes are connected by an edge. Take the column sum of the adjacency matrix to see the total number of connected edges to the node. Record the indices of this vector with column sum for the use of effect size adjustment in later step. b. Perturb the first network to generate the second network for z = 2 by removing the edges around nodes using the indices obtained in previous step. To assess the effect size of group, the top 5% , 10% , and 20% of total nodes with the most number of edges in a network lose their edges (e.g. 2 nodes with the most number of edges for a network with p = 20 for the effect size of 10% , Fig. 1). This is the end of scenario I. c. For scenario II, further perturb remaining networks by removing edges of one additional node with the next most number of edges, coming after Step (b) above. This is to simulate networks with a covariate dependence structure on both age and group (see Fig. 2).

Fig. 1
Network plots visualizing the gene network ( p = 20 ) without a covariate dependence structure that depends on binary group only (scenario I). The row represents group whereas the column represents age categories. The three networks in each row are identical, since there is no effect of age on the structure of network. The edges of the hub nodes are removed based on the effect size of the binary group Fig. 2 Network plots visualizing the gene network ( p = 20 ) with a covariate dependence structure that depends on age and group information (scenario II). The row represents group whereas the column represents age categories. All six networks have unique structure of the network. The edges of the hub nodes are firstly removed based on the effect size of the group, as shown in Fig. 1

Scenario III
We created a scenario where age is acting like a confounder. In other words, for a given each category that two networks are the same, but the distributions of the age of the patients are different in the two groups. Therefore, there will be an observed difference in network connectivity, which is explained through age.

Fig. 3
Network plots visualizing the gene network ( p = 20 ) with a covariate dependence structure that depends on age and group information with unequal sampling proportions with respect to different distribution of the age in the two groups (scenario III). The row represents group whereas the column represents age categories. All six networks have unique structure of the network. The edges of the two hub nodes are removed for each age category. To employ the effect of group, 10%/10%/80% of the subjects in z = 1 will have a network structure to each of the first, second, and third networks in the first row. In contrast, 80%/10%/10% of the subjects in z = 2 will have a network structure to each of the first, second, and third networks in the second row Fig. 4 A Venn diagram displaying the number of overlapping DC genes between and among univariable analysis such as DINGO and dnapath versus multivariable robust regression with pseudo-value approach using COPDGene study data from GEO database a. Generate the first random network with p nodes for younger than 50 category. The p × p adjacency matrix, where the diagonal elements are 0 and non-diagonal elements are either {0, 1} , is extracted from this first graph. Record the indices of connected edges for the perturbation of network in later steps below. b. Perturb the first network to generate the second network for age 50-60 category by removing the edges of the two nodes with the most number of edges in a network lose all of their edges. In other words, we refer to the indices, recorded in the adjacency matrix from the earlier step, and remove all the connected edges around the two nodes. c. Next, we repeat the same to perturb the second network to obtain the third network for older than 60 category (see Fig. 3).
For all three scenarios, we assign a partial correlation to edges to obtain weighted networks [41]. Note that adjacency matrices of these weighted networks are used for the the true connection per gene. Generate RNA-seq samples based on weighted networks with equal sampling proportions for scenarios I and II. However, specifically for scenario III, a sampling proportion differs across age categories and groups. That is, 10%/10%/80% for z = 1 and 80%/10%/10% for z = 2 . The data generation involves with two major steps. Firstly, we generate gene expressions (Gaussian values) from a group-specific weighted network for each gene, denoted as x i ∼ N (0, 1) . These Gaussian values are then mapped into RNA-seq data column-wise by using the inverse CDF of empirical distribution of the reference data using expression data with accession number GSE158699 [25] from the Gene Expression Omnibus (GEO) database [26]. We will have n z × p matrices for each group z = 1, 2.

Algorithm
1 Obtain an association matrix with ARACNE from the data generated in steps from the "Data Generation" section to fit an estimated network using minet [32,42] for each group. 2 For each gene k, calculate the column sums of association matrix for each group z separately, denoted by θ z k . 3 For each gene k and individual i ∈ G z , compute θ z k(i) from the association matrix that is re-estimated based on RNA-seq data without the ith subject of n z × p data from the "Data Generation" section for each group z separately, where i = 1, . . . , n z . 4 Calculate θ ik using Eq. (1) based on Step 2 and 3. 5 For each gene k, fit a multivariable robust regression with binary group variable and continuous age variable to obtain the p values of the group variable, computed from the t test. These p values are used to compute the performance measures of simulation study. More details on the performance measures are stated next.

Performance measures
To evaluate the performance of our proposed method, precision, recall, and the F1 score are calculated. Let z ∈ R p×p be the adjacency matrix for group z, where for z = 1, 2 . Then, for each gene k, we calculate where I(·) is an indicator function to determine whether gene k has differential connectivity. The gene k is truly DC if η k = 1 , and is not DC if η k = 0 for the true gene network. Similarly, for the covariate dependence structure, the following quantities are obtained where z,c ∈ R p×p be the adjacency matrix for group z and age category c = 1, 2, 3 . Denote that S is the total number of Monte Carlo simulation replicates. Let q ks be adjusted p value as in following procedure [39] of kth gene at the sth simulation replicate. α represents the magnitude of error control, and 0.05 was used throughout the simulation.
• Precision is the proportion of genes that are inferred to be significantly DC from the test which have true connection between two comparing groups: • Recall is the proportion of genes that have true connection which are correctly inferred to be significantly DC between two comparing groups from the test: .